A dataset of 113,937 loans is provided by Prosper. Each loan has 81 variables including loan amount, borrower rate/interest rate, loan status, borrower income and many others.
The goal of this analysis is to answer the following question: What affects the borrowers interest rate?
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
%matplotlib inline
df = pd.read_csv('prosperLoanData.csv')
print(df.shape)
(113937, 81)
df.head()
| ListingKey | ListingNumber | ListingCreationDate | CreditGrade | Term | LoanStatus | ClosedDate | BorrowerAPR | BorrowerRate | LenderYield | ... | LP_ServiceFees | LP_CollectionFees | LP_GrossPrincipalLoss | LP_NetPrincipalLoss | LP_NonPrincipalRecoverypayments | PercentFunded | Recommendations | InvestmentFromFriendsCount | InvestmentFromFriendsAmount | Investors | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1021339766868145413AB3B | 193129 | 2007-08-26 19:09:29.263000000 | C | 36 | Completed | 2009-08-14 00:00:00 | 0.16516 | 0.1580 | 0.1380 | ... | -133.18 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 258 |
| 1 | 10273602499503308B223C1 | 1209647 | 2014-02-27 08:28:07.900000000 | NaN | 36 | Current | NaN | 0.12016 | 0.0920 | 0.0820 | ... | 0.00 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 1 |
| 2 | 0EE9337825851032864889A | 81716 | 2007-01-05 15:00:47.090000000 | HR | 36 | Completed | 2009-12-17 00:00:00 | 0.28269 | 0.2750 | 0.2400 | ... | -24.20 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 41 |
| 3 | 0EF5356002482715299901A | 658116 | 2012-10-22 11:02:35.010000000 | NaN | 36 | Current | NaN | 0.12528 | 0.0974 | 0.0874 | ... | -108.01 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 158 |
| 4 | 0F023589499656230C5E3E2 | 909464 | 2013-09-14 18:38:39.097000000 | NaN | 36 | Current | NaN | 0.24614 | 0.2085 | 0.1985 | ... | -60.27 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 20 |
5 rows × 81 columns
Given the large number of variables, I'll explore each by data type
print(df.dtypes[df.dtypes == 'object'])
print(df.dtypes[df.dtypes == 'float64'])
print(df.dtypes[df.dtypes == 'int64'])
print(df.dtypes[df.dtypes == 'bool'])
ListingKey object ListingCreationDate object CreditGrade object LoanStatus object ClosedDate object ProsperRating (Alpha) object BorrowerState object Occupation object EmploymentStatus object GroupKey object DateCreditPulled object FirstRecordedCreditLine object IncomeRange object LoanKey object LoanOriginationDate object LoanOriginationQuarter object MemberKey object dtype: object BorrowerAPR float64 BorrowerRate float64 LenderYield float64 EstimatedEffectiveYield float64 EstimatedLoss float64 EstimatedReturn float64 ProsperRating (numeric) float64 ProsperScore float64 EmploymentStatusDuration float64 CreditScoreRangeLower float64 CreditScoreRangeUpper float64 CurrentCreditLines float64 OpenCreditLines float64 TotalCreditLinespast7years float64 OpenRevolvingMonthlyPayment float64 InquiriesLast6Months float64 TotalInquiries float64 CurrentDelinquencies float64 AmountDelinquent float64 DelinquenciesLast7Years float64 PublicRecordsLast10Years float64 PublicRecordsLast12Months float64 RevolvingCreditBalance float64 BankcardUtilization float64 AvailableBankcardCredit float64 TotalTrades float64 TradesNeverDelinquent (percentage) float64 TradesOpenedLast6Months float64 DebtToIncomeRatio float64 StatedMonthlyIncome float64 TotalProsperLoans float64 TotalProsperPaymentsBilled float64 OnTimeProsperPayments float64 ProsperPaymentsLessThanOneMonthLate float64 ProsperPaymentsOneMonthPlusLate float64 ProsperPrincipalBorrowed float64 ProsperPrincipalOutstanding float64 ScorexChangeAtTimeOfListing float64 LoanFirstDefaultedCycleNumber float64 MonthlyLoanPayment float64 LP_CustomerPayments float64 LP_CustomerPrincipalPayments float64 LP_InterestandFees float64 LP_ServiceFees float64 LP_CollectionFees float64 LP_GrossPrincipalLoss float64 LP_NetPrincipalLoss float64 LP_NonPrincipalRecoverypayments float64 PercentFunded float64 InvestmentFromFriendsAmount float64 dtype: object ListingNumber int64 Term int64 ListingCategory (numeric) int64 OpenRevolvingAccounts int64 LoanCurrentDaysDelinquent int64 LoanMonthsSinceOrigination int64 LoanNumber int64 LoanOriginalAmount int64 Recommendations int64 InvestmentFromFriendsCount int64 Investors int64 dtype: object IsBorrowerHomeowner bool CurrentlyInGroup bool IncomeVerifiable bool dtype: object
categorical_cols = ['CreditGrade', 'ProsperRating (Alpha)', 'BorrowerState','Occupation','EmploymentStatus','IncomeRange']
numeric_cols = ['BorrowerRate', 'ProsperRating (numeric)', 'ProsperScore', 'EmploymentStatusDuration', 'CreditScoreRangeLower', 'CreditScoreRangeUpper', 'TotalCreditLinespast7years', 'InquiriesLast6Months', 'DelinquenciesLast7Years', 'PublicRecordsLast12Months', 'DebtToIncomeRatio', 'TotalProsperLoans', 'MonthlyLoanPayment', 'ListingCategory (numeric)', 'LoanOriginalAmount' ]
bool_cols = ['IsBorrowerHomeowner']
all_cols = categorical_cols + numeric_cols + bool_cols
df[all_cols].head(10)
| CreditGrade | ProsperRating (Alpha) | BorrowerState | Occupation | EmploymentStatus | IncomeRange | BorrowerRate | ProsperRating (numeric) | ProsperScore | EmploymentStatusDuration | ... | TotalCreditLinespast7years | InquiriesLast6Months | DelinquenciesLast7Years | PublicRecordsLast12Months | DebtToIncomeRatio | TotalProsperLoans | MonthlyLoanPayment | ListingCategory (numeric) | LoanOriginalAmount | IsBorrowerHomeowner | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | C | NaN | CO | Other | Self-employed | $25,000-49,999 | 0.1580 | NaN | NaN | 2.0 | ... | 12.0 | 3.0 | 4.0 | 0.0 | 0.17 | NaN | 330.43 | 0 | 9425 | True |
| 1 | NaN | A | CO | Professional | Employed | $50,000-74,999 | 0.0920 | 6.0 | 7.0 | 44.0 | ... | 29.0 | 3.0 | 0.0 | 0.0 | 0.18 | NaN | 318.93 | 2 | 10000 | False |
| 2 | HR | NaN | GA | Other | Not available | Not displayed | 0.2750 | NaN | NaN | NaN | ... | 3.0 | 0.0 | 0.0 | NaN | 0.06 | NaN | 123.32 | 0 | 3001 | False |
| 3 | NaN | A | GA | Skilled Labor | Employed | $25,000-49,999 | 0.0974 | 6.0 | 9.0 | 113.0 | ... | 29.0 | 0.0 | 14.0 | 0.0 | 0.15 | NaN | 321.45 | 16 | 10000 | True |
| 4 | NaN | D | MN | Executive | Employed | $100,000+ | 0.2085 | 3.0 | 4.0 | 44.0 | ... | 49.0 | 1.0 | 0.0 | 0.0 | 0.26 | 1.0 | 563.97 | 2 | 15000 | True |
| 5 | NaN | B | NM | Professional | Employed | $100,000+ | 0.1314 | 5.0 | 10.0 | 82.0 | ... | 49.0 | 0.0 | 0.0 | 0.0 | 0.36 | NaN | 342.37 | 1 | 15000 | True |
| 6 | NaN | E | KS | Sales - Retail | Employed | $25,000-49,999 | 0.2712 | 2.0 | 2.0 | 172.0 | ... | 20.0 | 0.0 | 0.0 | 0.0 | 0.27 | NaN | 122.67 | 1 | 3000 | False |
| 7 | NaN | C | CA | Laborer | Employed | $25,000-49,999 | 0.2019 | 4.0 | 4.0 | 103.0 | ... | 10.0 | 3.0 | 0.0 | 0.0 | 0.24 | NaN | 372.60 | 2 | 10000 | False |
| 8 | NaN | AA | IL | Food Service | Employed | $25,000-49,999 | 0.0629 | 7.0 | 9.0 | 269.0 | ... | 32.0 | 1.0 | 0.0 | 0.0 | 0.25 | NaN | 305.54 | 7 | 10000 | True |
| 9 | NaN | AA | IL | Food Service | Employed | $25,000-49,999 | 0.0629 | 7.0 | 11.0 | 269.0 | ... | 32.0 | 1.0 | 0.0 | 0.0 | 0.25 | NaN | 305.54 | 7 | 10000 | True |
10 rows × 22 columns
Before working with the data, I'd like to make a few changes so that its easier to work with:
IsNewRating yes/no flag that allows me to easily filter one or the other. I thought about melting the different types of ratings columns, however since there score values might be different in the different time periods, I don't feel this is a good idea.Occupation and EmploymentStatus to nominal categorical variables and IncomeRange, CreditGrade and ProsperRatingAlpha to a ordinal categorical variable.ListingCategory (numeric) is a numeric field, I feel its well suited to being a nominal categorical variable instead, so will convert it.# renaming columns to make them easier to work with
df.rename(columns={'ProsperRating (numeric)': 'ProsperRatingNumeric', 'ProsperRating (Alpha)': 'ProsperRatingAlpha', 'ListingCategory (numeric)':'ListingCategoryNumeric'}, inplace=True)
# convert income range, credit grade and prosper rating alpha into ordered categorical types
ordinal_var_dict = {'IncomeRange': ['Not displayed','Not employed','$0','$1-24,999','$25,000-49,999', '$50,000-74,999', '$75,000-99,999', '$100,000+'],
'CreditGrade': ['NC', 'HR', 'E', 'D', 'C', 'B', 'A', 'AA'],
'ProsperRatingAlpha': ['HR', 'E', 'D', 'C', 'B', 'A', 'AA']}
for var in ordinal_var_dict:
ordered_var = pd.api.types.CategoricalDtype(ordered = True,
categories = ordinal_var_dict[var])
df[var] = df[var].astype(ordered_var)
# convert loan status, listing category, occupation, employment status into nominal categorical variables
nominal_vars = ['ListingCategoryNumeric', 'Occupation', 'EmploymentStatus']
df[nominal_vars] = df[nominal_vars].astype('category')
# a new rating is a record which has a value for ProsperRatingAlpha (and ProsperRatingNumber/ProsperRatingScore), IsNewRating should be True when ProsperRatingAlpha is not null
df['IsNewRating'] = ~df['ProsperRatingAlpha'].isna()
# put each type of data into its own set of named columns so that its easier to pick these out later
categorical_cols = ['CreditGrade', 'ProsperRatingAlpha', 'BorrowerState','Occupation','EmploymentStatus','IncomeRange']
numeric_cols = ['BorrowerRate', 'ProsperRatingNumeric', 'ProsperScore', 'EmploymentStatusDuration', 'CreditScoreRangeLower', 'CreditScoreRangeUpper', 'TotalCreditLinespast7years', 'InquiriesLast6Months', 'DelinquenciesLast7Years', 'PublicRecordsLast12Months', 'DebtToIncomeRatio', 'TotalProsperLoans', 'MonthlyLoanPayment', 'ListingCategoryNumeric', 'LoanOriginalAmount' ]
bool_cols = ['IsBorrowerHomeowner', 'IsNewRating']
all_cols = categorical_cols + numeric_cols + bool_cols
df_trimmed = df[all_cols]
df_trimmed.head(10)
| CreditGrade | ProsperRatingAlpha | BorrowerState | Occupation | EmploymentStatus | IncomeRange | BorrowerRate | ProsperRatingNumeric | ProsperScore | EmploymentStatusDuration | ... | InquiriesLast6Months | DelinquenciesLast7Years | PublicRecordsLast12Months | DebtToIncomeRatio | TotalProsperLoans | MonthlyLoanPayment | ListingCategoryNumeric | LoanOriginalAmount | IsBorrowerHomeowner | IsNewRating | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | C | NaN | CO | Other | Self-employed | $25,000-49,999 | 0.1580 | NaN | NaN | 2.0 | ... | 3.0 | 4.0 | 0.0 | 0.17 | NaN | 330.43 | 0 | 9425 | True | False |
| 1 | NaN | A | CO | Professional | Employed | $50,000-74,999 | 0.0920 | 6.0 | 7.0 | 44.0 | ... | 3.0 | 0.0 | 0.0 | 0.18 | NaN | 318.93 | 2 | 10000 | False | True |
| 2 | HR | NaN | GA | Other | Not available | Not displayed | 0.2750 | NaN | NaN | NaN | ... | 0.0 | 0.0 | NaN | 0.06 | NaN | 123.32 | 0 | 3001 | False | False |
| 3 | NaN | A | GA | Skilled Labor | Employed | $25,000-49,999 | 0.0974 | 6.0 | 9.0 | 113.0 | ... | 0.0 | 14.0 | 0.0 | 0.15 | NaN | 321.45 | 16 | 10000 | True | True |
| 4 | NaN | D | MN | Executive | Employed | $100,000+ | 0.2085 | 3.0 | 4.0 | 44.0 | ... | 1.0 | 0.0 | 0.0 | 0.26 | 1.0 | 563.97 | 2 | 15000 | True | True |
| 5 | NaN | B | NM | Professional | Employed | $100,000+ | 0.1314 | 5.0 | 10.0 | 82.0 | ... | 0.0 | 0.0 | 0.0 | 0.36 | NaN | 342.37 | 1 | 15000 | True | True |
| 6 | NaN | E | KS | Sales - Retail | Employed | $25,000-49,999 | 0.2712 | 2.0 | 2.0 | 172.0 | ... | 0.0 | 0.0 | 0.0 | 0.27 | NaN | 122.67 | 1 | 3000 | False | True |
| 7 | NaN | C | CA | Laborer | Employed | $25,000-49,999 | 0.2019 | 4.0 | 4.0 | 103.0 | ... | 3.0 | 0.0 | 0.0 | 0.24 | NaN | 372.60 | 2 | 10000 | False | True |
| 8 | NaN | AA | IL | Food Service | Employed | $25,000-49,999 | 0.0629 | 7.0 | 9.0 | 269.0 | ... | 1.0 | 0.0 | 0.0 | 0.25 | NaN | 305.54 | 7 | 10000 | True | True |
| 9 | NaN | AA | IL | Food Service | Employed | $25,000-49,999 | 0.0629 | 7.0 | 11.0 | 269.0 | ... | 1.0 | 0.0 | 0.0 | 0.25 | NaN | 305.54 | 7 | 10000 | True | True |
10 rows × 23 columns
df_trimmed.isnull().sum()
CreditGrade 84984 ProsperRatingAlpha 29084 BorrowerState 5515 Occupation 3588 EmploymentStatus 2255 IncomeRange 0 BorrowerRate 0 ProsperRatingNumeric 29084 ProsperScore 29084 EmploymentStatusDuration 7625 CreditScoreRangeLower 591 CreditScoreRangeUpper 591 TotalCreditLinespast7years 697 InquiriesLast6Months 697 DelinquenciesLast7Years 990 PublicRecordsLast12Months 7604 DebtToIncomeRatio 8554 TotalProsperLoans 91852 MonthlyLoanPayment 0 ListingCategoryNumeric 0 LoanOriginalAmount 0 IsBorrowerHomeowner 0 IsNewRating 0 dtype: int64
df_trimmed.describe()
| BorrowerRate | ProsperRatingNumeric | ProsperScore | EmploymentStatusDuration | CreditScoreRangeLower | CreditScoreRangeUpper | TotalCreditLinespast7years | InquiriesLast6Months | DelinquenciesLast7Years | PublicRecordsLast12Months | DebtToIncomeRatio | TotalProsperLoans | MonthlyLoanPayment | LoanOriginalAmount | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 113937.000000 | 84853.000000 | 84853.000000 | 106312.000000 | 113346.000000 | 113346.000000 | 113240.000000 | 113240.000000 | 112947.000000 | 106333.000000 | 105383.000000 | 22085.000000 | 113937.000000 | 113937.00000 |
| mean | 0.192764 | 4.072243 | 5.950067 | 96.071582 | 685.567731 | 704.567731 | 26.754539 | 1.435085 | 4.154984 | 0.015094 | 0.275947 | 1.421100 | 272.475783 | 8337.01385 |
| std | 0.074818 | 1.673227 | 2.376501 | 94.480605 | 66.458275 | 66.458275 | 13.637871 | 2.437507 | 10.160216 | 0.154092 | 0.551759 | 0.764042 | 192.697812 | 6245.80058 |
| min | 0.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 19.000000 | 2.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1000.00000 |
| 25% | 0.134000 | 3.000000 | 4.000000 | 26.000000 | 660.000000 | 679.000000 | 17.000000 | 0.000000 | 0.000000 | 0.000000 | 0.140000 | 1.000000 | 131.620000 | 4000.00000 |
| 50% | 0.184000 | 4.000000 | 6.000000 | 67.000000 | 680.000000 | 699.000000 | 25.000000 | 1.000000 | 0.000000 | 0.000000 | 0.220000 | 1.000000 | 217.740000 | 6500.00000 |
| 75% | 0.250000 | 5.000000 | 8.000000 | 137.000000 | 720.000000 | 739.000000 | 35.000000 | 2.000000 | 3.000000 | 0.000000 | 0.320000 | 2.000000 | 371.580000 | 12000.00000 |
| max | 0.497500 | 7.000000 | 11.000000 | 755.000000 | 880.000000 | 899.000000 | 136.000000 | 105.000000 | 99.000000 | 20.000000 | 10.010000 | 8.000000 | 2251.510000 | 35000.00000 |
There are 113937 loans in the dataset with 81 features. Most variables are numeric, however CreditGrade, ProsperRating (Alpha) and IncomeRange are ordinal while LoanStatus, Occupation and EmploymentStatus are nominal categorical variables. CreditGrade is the rating of the customer pre-July 2099 and ProsperRating (Alpha) (along with PropserRating (numeric) and ProsperScore) is the rating post-July 2009.
I'm primarily interested in the understanding which variables affect the interest rate offered to the borrower.
There are a lot of possibilities with this data set, however I will focus on the credit grade/prosper rating, occupation, employment state, income range, number of credit lines ov erthe last 7 years, number of inquiries over last 6 months, delinqiencies over last 7 years, public records over last 12 months, debt to income ratio, total number of propser loans, monthly loan payment, listing category and the original loan amount requested. I feel these variables are likely to be the primary factors in deciding on a interest rate offered to the customer.
At a high-level, I'm putting down some ideas for later explorations, across univariate, bivariat and multivariate analysis:
BorrowerRate, Occupation, EmploymentStatus, IncomeRange, CreditGrade, ProsperScore, DebtToIncomeRatio, LoanOriginalAmount, IsBorrowerHomeowner and IsNewRatingBorrowerRate by LoanOriginalAmount and BorrowerRate and ProsperScoreBorrowerRate across different levels of IncomeRange, EmploymentStatus and OccupationBorrowerRate vs. LoanOriginalAmount, with an additional hue variable of DebtToIncome ratioTotalCreditLinespast7years, InquiriesLast6Months, DelinquenciesLast7Years, PublicRecordsLast12Months and BorroweRate. First let's investigate our primary variable of interest, the interest rate
# start with a standard-scaled plot
plt.figure(figsize=[8, 5])
plt.hist(data = df_trimmed, x = 'BorrowerRate')
plt.xlabel('Interest Rate (%)')
plt.show()
The distribution is slightly right-skewed, with the majority of interest rates being 0.1 and 0.2%. Let's investigate further
# start with a standard-scaled plot
binsize = 0.01
bins = np.arange(0, df['BorrowerRate'].max()+binsize, binsize)
plt.figure(figsize=[12, 8])
plt.hist(data = df_trimmed, x = 'BorrowerRate', bins=bins)
plt.xlabel('Rate')
plt.title("Borrower Rate Disbtribution", fontsize=15);
plt.show()
df_trimmed['BorrowerRate'].value_counts().sort_values(ascending=False)
0.3177 3672
0.3500 1905
0.3199 1651
0.2900 1508
0.2699 1319
...
0.1526 1
0.2960 1
0.0627 1
0.3478 1
0.0739 1
Name: BorrowerRate, Length: 2294, dtype: int64
There is a large spike at 0.3177, might be interested to see whats common amongst these specific applications later.
plt.figure(figsize=(12,8))
ax = df_trimmed[df_trimmed.BorrowerRate == 0.3177].ProsperScore.hist()
ax.set_title('Prosper Score for loans with most common interest rate of 0.3177')
ax.set_xlabel('Prosper Score');
Although all these applications have the same interest rate, they have a reasonably wide range of Prosper scores assigned to them, although 4 and 5 are the most common. This means that prosper score alone is not the sole factor dictating the interest rate offered.
Lets get a high-level glimpse of the distribution of our variables, before drilling into the details
default_color = sb.color_palette()[0]
plt.figure(figsize=(30,15))
for index, col in enumerate(categorical_cols):
plt.subplot(2, 3, index+1)
sb.countplot(df[col], color = default_color).set_title(f'Distribution of {col}');
C:\CondaEnvs\ds\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn( C:\CondaEnvs\ds\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn(
Under the old scoring system, we can see that scores C and D were most common, while C and B are most common under the new system. Its hard to read some of the other graphs as there is too much info, so let's investigate them individually.
fig, ax = plt.subplots(ncols=2, figsize = [30,10])
default_color = sb.color_palette()[0]
sb.countplot(data = df_trimmed, x = 'EmploymentStatus', color = default_color, ax = ax[0]).set_title('Distribution of Employment Status')
sb.countplot(data = df_trimmed, x = 'IncomeRange', color = default_color, ax = ax[1]).set_title('Distribution of Income Range')
plt.show()
# Find the 10 top states and plot them as a bar chart
plot_data = df_trimmed.groupby('BorrowerState', as_index=False).count().sort_values('CreditGrade', ascending=False)[['BorrowerState','CreditGrade']].rename(columns={'CreditGrade':'count'}).head(10)
plt.figure(figsize=[12, 8])
default_color = sb.color_palette()[0]
sb.barplot(x="BorrowerState", y="count", data=plot_data, color = default_color)
plt.title('Borrower State')
plt.show()
df_trimmed.Occupation.value_counts().sort_values(ascending=False)[:10].plot.barh(figsize=(15,8), title="Top 10 Professions")
<AxesSubplot:title={'center':'Top 10 Professions'}>
df_trimmed.Occupation.value_counts().sort_values(ascending=True)[:10].plot.barh(figsize=(15,8), title="Bottom 10 Professions")
<AxesSubplot:title={'center':'Bottom 10 Professions'}>
Now we can see that most people are Employed (and likely Full-time employed) and least are not employed or retired. Most common income ranges are \$25K - \\$50K and \$50K - \\$75K and CA is the most common state. Other and Professional are the most common, which are likely just "catch-all" categories, this doesnt give us much info. Interestingly though, Computer Programmers are the most common "non-generic" listed profession. Most types of students form the bottom 10 listed professions, as one might expect.
Lets have a quick review of the numerical and boolean variables, in order to check for outliers
plt.figure(figsize=(30,30))
for index, col in enumerate(numeric_cols):
plt.subplot(5, 3, index+1)
sb.distplot(df[col], kde=False);
C:\CondaEnvs\ds\lib\site-packages\seaborn\distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
Initial insights:
Many of the variables above seem to have outliers, based on the heavily right-skewed distribution. Let's see if we can study them
cols = ['InquiriesLast6Months','DelinquenciesLast7Years','PublicRecordsLast12Months','DebtToIncomeRatio', 'MonthlyLoanPayment', 'LoanOriginalAmount']
df[cols].describe()
| InquiriesLast6Months | DelinquenciesLast7Years | PublicRecordsLast12Months | DebtToIncomeRatio | MonthlyLoanPayment | LoanOriginalAmount | |
|---|---|---|---|---|---|---|
| count | 113240.000000 | 112947.000000 | 106333.000000 | 105383.000000 | 113937.000000 | 113937.00000 |
| mean | 1.435085 | 4.154984 | 0.015094 | 0.275947 | 272.475783 | 8337.01385 |
| std | 2.437507 | 10.160216 | 0.154092 | 0.551759 | 192.697812 | 6245.80058 |
| min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1000.00000 |
| 25% | 0.000000 | 0.000000 | 0.000000 | 0.140000 | 131.620000 | 4000.00000 |
| 50% | 1.000000 | 0.000000 | 0.000000 | 0.220000 | 217.740000 | 6500.00000 |
| 75% | 2.000000 | 3.000000 | 0.000000 | 0.320000 | 371.580000 | 12000.00000 |
| max | 105.000000 | 99.000000 | 20.000000 | 10.010000 | 2251.510000 | 35000.00000 |
Let's isolate the outliers and study them
high_outliers = ((df_trimmed['InquiriesLast6Months'] > 10) | (df_trimmed['DelinquenciesLast7Years'] > 10) | (df_trimmed['PublicRecordsLast12Months'] > 3) | (df_trimmed['DebtToIncomeRatio'] > 2) | (df_trimmed['MonthlyLoanPayment'] > 1000) | (df_trimmed['LoanOriginalAmount'] > 20000))
print(high_outliers.sum())
df_trimmed.loc[high_outliers, :]
21314
| CreditGrade | ProsperRatingAlpha | BorrowerState | Occupation | EmploymentStatus | IncomeRange | BorrowerRate | ProsperRatingNumeric | ProsperScore | EmploymentStatusDuration | ... | InquiriesLast6Months | DelinquenciesLast7Years | PublicRecordsLast12Months | DebtToIncomeRatio | TotalProsperLoans | MonthlyLoanPayment | ListingCategoryNumeric | LoanOriginalAmount | IsBorrowerHomeowner | IsNewRating | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3 | NaN | A | GA | Skilled Labor | Employed | $25,000-49,999 | 0.0974 | 6.0 | 9.0 | 113.0 | ... | 0.0 | 14.0 | 0.0 | 0.15 | NaN | 321.45 | 16 | 10000 | True | True |
| 10 | NaN | C | MD | Fireman | Employed | $75,000-99,999 | 0.2489 | 4.0 | 7.0 | 300.0 | ... | 1.0 | 28.0 | 0.0 | 0.12 | NaN | 395.37 | 1 | 13500 | False | True |
| 16 | NaN | E | CA | Professional | Full-time | $75,000-99,999 | 0.2809 | 2.0 | 4.0 | 133.0 | ... | 1.0 | 12.0 | 0.0 | 0.15 | 3.0 | 124.76 | 6 | 4000 | True | True |
| 22 | NaN | HR | NY | Other | Other | $1-24,999 | 0.3134 | 1.0 | 2.0 | 68.0 | ... | 0.0 | 0.0 | 0.0 | 9.20 | NaN | 172.76 | 1 | 4000 | False | True |
| 29 | NaN | A | LA | Other | Employed | $100,000+ | 0.1239 | 6.0 | 6.0 | 16.0 | ... | 2.0 | 0.0 | 0.0 | 0.32 | NaN | 1169.03 | 1 | 35000 | True | True |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 113911 | NaN | D | MD | Other | Employed | $25,000-49,999 | 0.2155 | 3.0 | 3.0 | 51.0 | ... | 3.0 | 48.0 | 0.0 | 0.29 | NaN | 379.58 | 14 | 10000 | False | True |
| 113914 | HR | NaN | CA | NaN | NaN | Not displayed | 0.2500 | NaN | NaN | NaN | ... | 9.0 | 16.0 | NaN | 0.05 | NaN | 119.28 | 0 | 3000 | False | False |
| 113916 | NaN | E | FL | Professional | Employed | $50,000-74,999 | 0.2932 | 2.0 | 6.0 | 149.0 | ... | 1.0 | 18.0 | 0.0 | 0.15 | NaN | 168.32 | 13 | 4000 | False | True |
| 113926 | E | NaN | CA | Other | Not available | Not displayed | 0.2900 | NaN | NaN | NaN | ... | 2.0 | 18.0 | NaN | 0.07 | NaN | 125.72 | 0 | 3000 | False | False |
| 113931 | NaN | B | IL | Analyst | Employed | $75,000-99,999 | 0.1274 | 5.0 | 6.0 | 12.0 | ... | 2.0 | 0.0 | 0.0 | 0.28 | NaN | 565.50 | 3 | 25000 | False | True |
21314 rows × 23 columns
Its not clear that there is anything wrong with these loans, however it will be good to view the effect on the distribution once they are removed.
df = df_trimmed.loc[-high_outliers,:]
plt.figure(figsize=(30,30))
for index, col in enumerate(numeric_cols):
plt.subplot(5, 3, index+1)
sb.distplot(df[col], kde=False);
C:\CondaEnvs\ds\lib\site-packages\seaborn\distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
Now its a little more clear that:
The distribution is slightly right-skewed, with the majority of interest rates being 0.1 and 0.2%. There is a large spike at 0.3177. Although all these applications have the same interest rate, they have a reasonably wide range of Prosper scores assigned to them, although 4 and 5 are the most common. This means that prosper score alone is not the sole factor dictating the interest rate offered. The variables wasn't transformed however a more suitable bin size was chosen to get a more detailed view of the variable.
CreditGrade is used prior to July 2009, while ProsperRating (alpha and numeric)/ProsperScore are used from July 2009. A new IsNewRating yes/no flag was added that allows me to easily filter one or the other. I thought about melting the different types of ratings columns, however since there score values might be different in the different time periods, I don't feel this is a good idea.
Decided to convert Occupation and EmploymentStatus to nominal categorical variables and IncomeRange, CreditGrade and ProsperRatingAlpha to a ordinal categorical variable.
Although ListingCategory (numeric) is a numeric field, I feel its well suited to being a nominal categorical variable instead, so it was converted as well. A few columns were also renamed to make them easier to work with.
In terms of the distribution study, we determined:
InquiriesLast6Months,DelinquenciesLast7Years,PublicRecordsLast12Months,DebtToIncomeRatio, MonthlyLoanPayment, LoanOriginalAmountdfcategorical_cols = ['CreditGrade', 'ProsperRatingAlpha', 'BorrowerState','Occupation','EmploymentStatus','IncomeRange']
numeric_cols = ['BorrowerRate', 'ProsperRatingNumeric', 'ProsperScore', 'EmploymentStatusDuration', 'CreditScoreRangeLower', 'CreditScoreRangeUpper', 'TotalCreditLinespast7years', 'InquiriesLast6Months', 'DelinquenciesLast7Years', 'PublicRecordsLast12Months', 'DebtToIncomeRatio', 'TotalProsperLoans', 'MonthlyLoanPayment', 'ListingCategoryNumeric', 'LoanOriginalAmount' ]
bool_cols = ['IsBorrowerHomeowner', 'IsNewRating']
all_cols = categorical_cols + numeric_cols + bool_cols
Let's have a look at pair-wise correlations between our numerical variables.
# correlation plot
plt.figure(figsize = [15, 10])
sb.heatmap(df[numeric_cols].corr(), annot = True, fmt = '.3f',
cmap = 'vlag_r', center = 0)
plt.show()
We see some postiive correlations between monthly loan payment and loan original amount, as one would expect and negative correlations between borrower rate and prosperscore, i.e. the worse the score, the higher the rate. This doesn't give us too much info though as the prosper score is likely a function of all the other variables. As such, its an expoected correlation which doesn't describe why the borrower rate is what it is.
Credit score has a medium strong negative correlation with borrower rate, i.e. the lower the credit score, the higher the borrower rate. Many of the other numeric variable have a weak linear correlation with borrower rate, but its possible that their relationship is non-linear with borrower rate.
Let's try bring some of these non-linearities out now
# plot matrix: sample 500 diamonds so that plots are clearer and they render faster
print("df.shape=",df.shape)
df_samp = df.sample(n=500, replace = False)
print("df_samp.shape=",df_samp.shape)
g = sb.PairGrid(data = df_samp, vars = numeric_cols)
g = g.map_diag(plt.hist, bins = 20);
g.map_offdiag(plt.scatter)
df.shape= (92623, 23) df_samp.shape= (500, 23)
<seaborn.axisgrid.PairGrid at 0x28978cdee80>
It seems like there is positive correlation between the credit score and the loan amount offered, but otherwise I don't feel there is much to be gained from this plot. Let's take a look at the relationship between borrower rate and some of the categorical variables
# plot matrix of numeric features against categorical features.
# can use a larger sample since there are fewer plots and they're simpler in nature.
# Deprecated
# samples = np.random.choice(diamonds.shape[0], 2000, replace = False)
# diamonds_samp = diamonds.loc[samples,:]
df_samp = df.sample(n=2000, replace = False)
categorical_cols1 = ['CreditGrade', 'ProsperRatingAlpha']
categorical_cols2 = ['EmploymentStatus','IncomeRange']
def boxgrid(x, y, **kwargs):
""" Quick hack for creating box plots with seaborn's PairGrid. """
default_color = sb.color_palette()[0]
sb.boxplot(x=x, y=y, color=default_color)
#plt.figure(figsize = [15, 20])
g = sb.PairGrid(data = df_samp, y_vars = ['BorrowerRate'], x_vars = categorical_cols1,
height = 5, aspect = 1.5)
g.map(boxgrid)
plt.show();
#plt.figure(figsize = [15, 20])
g = sb.PairGrid(data = df_samp, y_vars = ['BorrowerRate'], x_vars = categorical_cols2,
height = 5, aspect = 1.5)
g.map(boxgrid)
for ax in g.axes.flatten():
ax.tick_params(rotation = 45)
plt.show();
Lets have a look at the relationship between some of the categorical variables
categorical_cols
['CreditGrade', 'ProsperRatingAlpha', 'BorrowerState', 'Occupation', 'EmploymentStatus', 'IncomeRange']
# since there's only four subplots to create, using the full data should be fine.
plt.figure(figsize = [15, 12])
# subplot 1: EmploymentStatus vs IncomeRange
plt.subplot(3, 1, 1)
sb.countplot(data = df, x = 'EmploymentStatus', hue = 'IncomeRange', palette = 'Blues')
# subplot 2: EmploymentStatus vs. ProsperRatingAlpha
ax = plt.subplot(3, 1, 2)
sb.countplot(data = df, x = 'EmploymentStatus', hue = 'ProsperRatingAlpha', palette = 'Blues')
#ax.legend(ncol = 2) # re-arrange legend to reduce overlapping
# subplot 3: IncomeRange vs. ProsperRatingAlpha, use different color palette
ax = plt.subplot(3, 1, 3)
sb.countplot(data = df, x = 'IncomeRange', hue = 'ProsperRatingAlpha', palette = 'Greens')
#ax.legend(loc = 1, ncol = 2) # re-arrange legend to remove overlapping
plt.show()
Correlates with what we saw earlier, i.e. highest earners will tend to be full-time employees, D ProsperRating is most common for those in the 25K - 50L range and C in the 50K - 75K range
Let's dive into the relationship between some of our other variables
cols = ['BorrowerRate','DebtToIncomeRatio','TotalCreditLinespast7years','DelinquenciesLast7Years', 'PublicRecordsLast12Months', 'IsBorrowerHomeowner']
g = sb.PairGrid(df[cols], hue="IsBorrowerHomeowner")
g.map_diag(sb.histplot)
g.map_offdiag(sb.scatterplot)
g.add_legend(title="Home Owner", adjust_subtitles=True)
<seaborn.axisgrid.PairGrid at 0x28906423cd0>
Interestingly, home owners tend to have a lower debit to income ratio while having more credit lines over the last 7 years. Likewise, being a home owner tends to attract a lower borrower rate.
Lets delve further into the relationship between borrower rate, home ownership and debt to income ratio
ax = sb.pairplot(df[['BorrowerRate', 'IsBorrowerHomeowner', 'DebtToIncomeRatio']], hue="IsBorrowerHomeowner", height=5, aspect=1.5)
ax.fig.subplots_adjust(top=.95)
ax.fig.suptitle('Relationship between Borrower Rate, Home Ownership and Debt to Income Ratio');
The distribution for DebtToIncome is quite similar whether a home owner or not, while being a home owner does attract a slightly better borrower rate, as we already observed above.
Let's see if we can spot a relationship between employee status duration and borrower rate
df['EmploymentStatusDuration'].value_counts()
0.0 1286
4.0 983
3.0 977
2.0 959
5.0 957
...
567.0 1
662.0 1
733.0 1
690.0 1
581.0 1
Name: EmploymentStatusDuration, Length: 591, dtype: int64
g = sb.PairGrid(df[['EmploymentStatusDuration', 'BorrowerRate']], height=5, aspect=1.5)
g.map_diag(sb.histplot)
g.map_offdiag(sb.scatterplot)
g.add_legend()
<seaborn.axisgrid.PairGrid at 0x289061417c0>
Doesnt seem like there is much insight to be gained here.
I'd like to explore the relationship between Borrower Rate, Home Ownership, Credit Rating and Income Range.
with sb.axes_style("white"):
g = sb.FacetGrid(df, row="IsBorrowerHomeowner", col="IncomeRange", margin_titles=True, height=2.5)
g.map(sb.scatterplot, "CreditScoreRangeUpper", "BorrowerRate", color="#334488")
g.figure.subplots_adjust(wspace=.02, hspace=.02)
Its hard to make an inferences from this since the upper end of the credit range seems to be quite similar across the board
We know that the Credit Grade/Prosper Score correlates with the borrower rate, but its not clear what contributes to that credit grade or prosper score, so let's unpack that a little
g = sb.FacetGrid(df, col="ProsperRatingAlpha")
g.map_dataframe(sb.scatterplot, x="LoanOriginalAmount", y="BorrowerRate", hue="IsBorrowerHomeowner")
g.add_legend()
<seaborn.axisgrid.FacetGrid at 0x28922436400>
We can see from this that ProsperRatingAlpha is the clearest indiator of the borrower rate, so we need to fully understand what leads to this prosper rating. We know from earlier that Employment Status and Income Range are important factors, so let's explore their relationship to Prosper Rating
df.IncomeRange.values
['$25,000-49,999', '$50,000-74,999', 'Not displayed', '$100,000+', '$100,000+', ..., '$50,000-74,999', '$75,000-99,999', '$25,000-49,999', '$25,000-49,999', '$50,000-74,999'] Length: 92623 Categories (8, object): ['Not displayed' < 'Not employed' < '$0' < '$1-24,999' < '$25,000-49,999' < '$50,000-74,999' < '$75,000-99,999' < '$100,000+']
g = sb.FacetGrid(df, col="ProsperRatingAlpha", height=6, aspect=.8)
g.map(sb.barplot, "IncomeRange", "BorrowerRate", order=["Not displayed", "Not employed", "$0", "$1-24,999", "$25,000-49,999", "$50,000-74,999", "$75,000-99,999", "$100,000+"])
for ax in g.axes.flatten():
ax.tick_params(rotation = 45)
plt.show();
For each prosper rating, we see almost equal contributions for all the ranges of income.
g = sb.FacetGrid(df, col="ProsperRatingAlpha", height=6, aspect=.8)
g.map(sb.barplot, "IsBorrowerHomeowner", "BorrowerRate");
C:\CondaEnvs\ds\lib\site-packages\seaborn\axisgrid.py:670: UserWarning: Using the barplot function without specifying `order` is likely to produce an incorrect plot. warnings.warn(warning)
For each prosper rating, we see almost equal contributions whether the person is a home owner or not
g = sb.FacetGrid(df, col="ProsperRatingAlpha", height=6, aspect=.8)
g.map(sb.barplot, "TotalProsperLoans", "BorrowerRate");
C:\CondaEnvs\ds\lib\site-packages\seaborn\axisgrid.py:670: UserWarning: Using the barplot function without specifying `order` is likely to produce an incorrect plot. warnings.warn(warning)
g = sb.FacetGrid(df, col="ProsperRatingAlpha", height=6, aspect=.8)
g.map(sb.barplot, "EmploymentStatus", "BorrowerRate")
for ax in g.axes.flatten():
ax.tick_params(rotation = 45)
plt.show();
C:\CondaEnvs\ds\lib\site-packages\seaborn\axisgrid.py:670: UserWarning: Using the barplot function without specifying `order` is likely to produce an incorrect plot. warnings.warn(warning)
Lets see how credit score varies across income range and prosper rating
with sb.axes_style("white"):
g = sb.FacetGrid(df, col="ProsperRatingAlpha", row="IncomeRange", margin_titles=True, height=2.5)
g.map(sb.scatterplot, "CreditScoreRangeUpper", "BorrowerRate", color="#334488")
#g.set_axis_labels("Total bill (US Dollars)", "Tip")
#g.set(xticks=[10, 30, 50], yticks=[2, 6, 10])
g.figure.subplots_adjust(wspace=.02, hspace=.02)
It is clear the the Prosper Rating given to a loan is the determining factor in what dictates the borrower rate, however I wasn't able to clearly see strong correlations with what led to the specific Prosper Rating. Credit rating, home ownership, income range, employment status all contribute somewhat, but a deeper analysis is required to further understand these and other relationships not investigated.
I found it interedting that although income range does contribute to the prosper rating and borrower rate, you still see borrower rates from low to high with a mix of income ranges, indicating that there are other factors at play.
From our analysis we can determine the following: